Satellite data can be used to estimate the type of landcover at locations around the world. This approach can be an time and cost effective alternative to manually inspecting these locations in person. In this report, we will explore a data set containing satellite data and manually labeled landcover types for locations in Benin. We will use this data to build a model that predicts the landcover type at a location based on the satellite data.
Landcover types:
Built-up: Built-up areas, also known as urban areas, are regions where the landscape is dominated by human-made structures such as buildings, roads, and other infrastructure. These areas are characterized by high-density development and are typically associated with cities, towns, and other developed communities. Built-up areas include residential, commercial, industrial, and institutional land uses.
Cropland: Cropland refers to land that is used for the cultivation of crops. This landcover type includes fields used for growing a variety of crops such as grains, vegetables, fruits, and other agricultural products. Cropland can vary in size from small family farms to large-scale industrial agricultural operations.
Natural Forest: Natural forests are areas covered by trees and other vegetation that have developed through natural processes without significant human intervention. These forests play a crucial role in maintaining biodiversity, regulating climate, and providing habitat for wildlife. Natural forests can be found in a variety of climatic zones, from tropical rainforests to temperate and boreal forests.
Orchard: An orchard is a type of agricultural land where trees or shrubs are cultivated primarily for fruit production. Orchards are typically designed for intensive farming practices, focusing on high yields of specific fruit species such as apples, oranges, cherries, and nuts. These areas require careful management and maintenance to ensure healthy tree growth and abundant fruit production.
The data labeled_points.Rdata contains data on blocks of
land in Benin.
load('data/labeled_points.Rdata')
The file contains two data frames.
1. labeled. The object labeled has 400
locations (with unique identifier ID). The
landcover type at each location has been manually labeled
by a human. Each ID has a unique latitude
(lat) and longitude (lon) and can be thought
of as a pixel in an image.
head(labeled) %>%
as.data.frame()
## ID lat lon landcover
## 1 17394 11.17707 2.297213 builtup
## 2 15545 11.11946 2.853562 builtup
## 3 15722 11.12590 2.940327 builtup
## 4 15489 11.11835 2.854627 builtup
## 5 10946 10.98139 3.283267 builtup
## 6 5208 10.78519 2.806184 builtup
unique(labeled$landcover)
## [1] "builtup" "cropland" "natforest" "orchard"
We see that the four labels are builtup,
cropland, natforest, and
orchard.
2. labeled_train. The object
labeled_train has 3 years of satellite imagery for each
ID. Images were collected every 16 days, and the
year, month, day, and
date for each location are given in the data. These were
all taken by the Landsat 7
satellite. The other columns are
labeled.labeled_train %>%
as.data.frame() %>%
head()
## B1 B2 B3 B4 B5 B6_VCID_1 B6_VCID_2 B7 B8 lat lon year month day
## 1 66 60 73 68 93 146 176 76 69 11.16359 3.413116 2017 1 3
## 2 64 59 69 72 89 147 178 72 69 11.16244 3.414022 2017 1 3
## 3 66 60 74 70 93 148 180 79 72 11.01509 3.416367 2017 1 3
## 4 69 63 85 75 113 151 185 90 75 10.93011 3.553716 2017 1 3
## 5 70 63 79 71 101 149 182 83 73 10.93055 3.553230 2017 1 3
## 6 62 55 66 68 83 148 181 69 67 10.92047 3.486705 2017 1 3
## date ID NDVI NDBI EVI
## 1 2017-01-03 16993 -0.03546099 0.15527950 -1.0416667
## 2 2017-01-03 16894 0.02127660 0.10559006 1.0714286
## 3 2017-01-03 11728 -0.02777778 0.14110429 -0.5000000
## 4 2017-01-03 9560 -0.06250000 0.20212766 -0.3649635
## 5 2017-01-03 9574 -0.05333333 0.17441860 -0.9523810
## 6 2017-01-03 8938 0.01492537 0.09933775 Inf
These band values will be different depending on the landcover type.
The following are known relationships:
We first join our labeled and labeled_train
data sets on ID.
labeled = labeled %>%
select(ID, landcover)
d = labeled_train %>%
left_join(labeled, by = 'ID')
Finally, let’s add a column for vegetation called veg
that is 1 if the landcover is natforest,
orchard, or cropland, and 0 otherwise. We’ll
also add a column for built-up called builtup that is 1 if
the landcover is builtup, and 0 otherwise.
Thus, we are adding indicators for vegetation and built-up areas.
d = d %>%
mutate(veg = ifelse(landcover %in% c('natforest', 'orchard', 'cropland'), 1, 0),
builtup = ifelse(landcover == 'builtup', 1, 0),
EVI = ifelse(is.infinite(EVI), NA, EVI))
glimpse(d)
## Rows: 18,308
## Columns: 22
## $ B1 <dbl> 66, 64, 66, 69, 70, 62, 64, 65, 67, 66, 63, 63, 69, 58, 60, …
## $ B2 <dbl> 60, 59, 60, 63, 63, 55, 60, 59, 57, 58, 55, 57, 63, 50, 49, …
## $ B3 <dbl> 73, 69, 74, 85, 79, 66, 79, 69, 68, 67, 64, 68, 81, 53, 58, …
## $ B4 <dbl> 68, 72, 70, 75, 71, 68, 72, 65, 64, 64, 65, 63, 72, 82, 84, …
## $ B5 <dbl> 93, 89, 93, 113, 101, 83, 100, 86, 88, 79, 78, 80, 97, 69, 7…
## $ B6_VCID_1 <dbl> 146, 147, 148, 151, 149, 148, 150, 149, 148, 149, 148, 150, …
## $ B6_VCID_2 <dbl> 176, 178, 180, 185, 182, 181, 183, 182, 179, 181, 180, 182, …
## $ B7 <dbl> 76, 72, 79, 90, 83, 69, 87, 72, 76, 68, 64, 69, 78, 43, 45, …
## $ B8 <dbl> 69, 69, 72, 75, 73, 67, 74, 65, 66, 65, 63, 65, 77, 68, 72, …
## $ lat <dbl> 11.16359, 11.16244, 11.01509, 10.93011, 10.93055, 10.92047, …
## $ lon <dbl> 3.413116, 3.414022, 3.416367, 3.553716, 3.553230, 3.486705, …
## $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ date <date> 2017-01-03, 2017-01-03, 2017-01-03, 2017-01-03, 2017-01-03,…
## $ ID <int> 16993, 16894, 11728, 9560, 9574, 8938, 8914, 12410, 12422, 9…
## $ NDVI <dbl> -0.035460993, 0.021276596, -0.027777778, -0.062500000, -0.05…
## $ NDBI <dbl> 0.15527950, 0.10559006, 0.14110429, 0.20212766, 0.17441860, …
## $ EVI <dbl> -1.04166667, 1.07142857, -0.50000000, -0.36496350, -0.952380…
## $ landcover <chr> "builtup", "builtup", "builtup", "builtup", "builtup", "buil…
## $ veg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, …
## $ builtup <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
Now we perform some data exploration to determine the relationship between vegetation and the band values. Also, we will look at the seasonality of the bands (i.e. which bands vary the most among landcover types).
## Create a data set that is one row per location
## with mean(NDVI), mean(B7), and landcover type for each location
dm = d %>%
group_by(ID) %>%
summarise(
B1 = mean(B1, na.rm=T),
B2 = mean(B2, na.rm=T),
B3 = mean(B3, na.rm=T),
B4 = mean(B4, na.rm=T),
B5 = mean(B5, na.rm=T),
B6_VCID_1 = mean(B6_VCID_1, na.rm=T),
B6_VCID_2 = mean(B6_VCID_2, na.rm=T),
B7 = mean(B7, na.rm=T),
NDVI = mean(NDVI, na.rm=T),
NDBI = mean(NDBI, na.rm=T),
EVI = mean(EVI, na.rm=T),
landcover = unique(landcover)) %>%
ungroup() %>%
mutate(veg = ifelse(landcover %in% c('natforest', 'orchard', 'cropland'),
1, 0),
builtup = ifelse(landcover == 'builtup', 1, 0),
EVI = ifelse(is.infinite(EVI), NA, EVI)) %>%
as.data.frame()
## Inspect the resulting data frame
glimpse(dm)
## Rows: 400
## Columns: 15
## $ ID <int> 2043, 2069, 2095, 2100, 2114, 2118, 2164, 2181, 2402, 2404, …
## $ B1 <dbl> 80.43333, 83.48000, 90.95833, 87.91304, 85.92857, 85.92857, …
## $ B2 <dbl> 69.16667, 73.52000, 81.70833, 78.30435, 77.42857, 77.42857, …
## $ B3 <dbl> 73.43333, 78.92000, 95.04167, 87.82609, 91.89286, 91.89286, …
## $ B4 <dbl> 91.36667, 94.56000, 97.08333, 95.30435, 75.10714, 75.10714, …
## $ B5 <dbl> 79.00000, 87.12000, 111.33333, 102.04348, 93.03571, 93.03571…
## $ B6_VCID_1 <dbl> 130.1333, 131.3200, 130.7083, 131.4348, 139.0357, 139.0357, …
## $ B6_VCID_2 <dbl> 147.7333, 150.0000, 148.8333, 149.8696, 164.2500, 164.2500, …
## $ B7 <dbl> 51.30000, 57.92000, 79.20833, 71.13043, 76.96429, 76.96429, …
## $ NDVI <dbl> 0.150671964, 0.134296182, 0.047115152, 0.076682407, -0.09528…
## $ NDBI <dbl> -0.087057210, -0.048786083, 0.054766177, 0.028438968, 0.1067…
## $ EVI <dbl> -1.0751116, 1.1790665, -3.6038950, -0.2102671, -0.8309687, -…
## $ landcover <chr> "orchard", "orchard", "orchard", "orchard", "builtup", "buil…
## $ veg <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ builtup <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
We see there are 400 rows, one for each location, and that each row
has ID, landcover type, and band values for
the corresponding location.
Here is a summary of the mean band values for each landcover type.
dg = dm %>%
select(-landcover, -builtup) %>%
pivot_longer(cols = -c(ID, veg)) %>%
group_by(name, veg) %>%
summarise(mean = mean(value,
na.rm = T)) %>%
mutate(mean = round(mean, 2),
veg = paste0('veg', veg)) %>%
pivot_wider(names_from = veg,
values_from = mean) %>%
mutate(diff = veg1 - veg0)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
dg
## # A tibble: 11 × 4
## # Groups: name [11]
## name veg0 veg1 diff
## <chr> <dbl> <dbl> <dbl>
## 1 B1 91.0 84.8 -6.23
## 2 B2 81.5 74.5 -7.05
## 3 B3 93.4 82.0 -11.3
## 4 B4 84.4 91.2 6.88
## 5 B5 95.2 92.7 -2.42
## 6 B6_VCID_1 134. 132. -1.70
## 7 B6_VCID_2 154. 151. -2.98
## 8 B7 76.4 65 -11.4
## 9 EVI -0.04 -0.39 -0.35
## 10 NDBI 0.06 0 -0.06
## 11 NDVI -0.04 0.09 0.13
Histogram of all bands, separated by veg.
dg = dm %>%
select(-landcover, -builtup) %>%
pivot_longer(cols = -c(ID, veg)) %>%
mutate(veg = factor(veg))
head(dg)
g = ggplot(dg,
aes(x = value,
fill = veg)) +
geom_density(alpha = 0.3) +
facet_wrap(~name,
scales = 'free')
g %>%
pub(type = 'hist',
facet = T,
base_size = 9)
ggsave("img/density_plot_per_band.png", plot = g)
## Saving 7 x 5 in image
## # A tibble: 6 × 4
## ID veg name value
## <int> <fct> <chr> <dbl>
## 1 2043 1 B1 80.4
## 2 2043 1 B2 69.2
## 3 2043 1 B3 73.4
## 4 2043 1 B4 91.4
## 5 2043 1 B5 79
## 6 2043 1 B6_VCID_1 130.
Scatter plot of bands vs veg
ds = dg %>%
mutate(veg = as.numeric(as.character(veg)))
g = ggplot(ds,
aes(x = value,
y = veg)) +
geom_jitter(alpha = 0.5,
height = 0.2,
width = 0,
color=pubblue) +
geom_smooth(color=pubred) +
facet_wrap(~name,
scales = 'free')
g %>%
pub(type = 'scatter',
facet = T,
base_size = 9,
ybreaks = c(0, 0.5, 1))
ggsave("img/scatter_plot_bands_vs_veg.png", plot = g)
It seems most band values have a logistic relationship with
veg.
Let’s have a closer look at NDVI vs veg,
since we know that NDVI is a good indicator of
vegetation.
g = ggplot(dm,
aes(NDVI,
veg))+
geom_jitter(height = 0.1,
width = 0,
alpha = 0.5,
color = pubblue) +
geom_smooth(color = pubred)
g %>% pub()
Corrplot of all bands
library(corrplot)
dcor = dm %>%
select(-ID, -veg, -builtup, -landcover) %>%
mutate(EVI = ifelse(is.infinite(EVI), NA, EVI)) %>%
cor(use = 'pairwise.complete.obs')
dcor %>% round(2)
corrplot(dcor)
## B1 B2 B3 B4 B5 B6_VCID_1 B6_VCID_2 B7 NDVI NDBI
## B1 1.00 0.97 0.85 0.28 0.43 -0.32 -0.30 0.56 -0.55 0.29
## B2 0.97 1.00 0.94 0.35 0.58 -0.20 -0.18 0.69 -0.59 0.38
## B3 0.85 0.94 1.00 0.25 0.76 0.05 0.06 0.87 -0.74 0.61
## B4 0.28 0.35 0.25 1.00 0.38 -0.35 -0.34 0.09 0.46 -0.37
## B5 0.43 0.58 0.76 0.38 1.00 0.32 0.33 0.90 -0.45 0.70
## B6_VCID_1 -0.32 -0.20 0.05 -0.35 0.32 1.00 1.00 0.38 -0.35 0.58
## B6_VCID_2 -0.30 -0.18 0.06 -0.34 0.33 1.00 1.00 0.39 -0.35 0.59
## B7 0.56 0.69 0.87 0.09 0.90 0.38 0.39 1.00 -0.76 0.84
## NDVI -0.55 -0.59 -0.74 0.46 -0.45 -0.35 -0.35 -0.76 1.00 -0.84
## NDBI 0.29 0.38 0.61 -0.37 0.70 0.58 0.59 0.84 -0.84 1.00
## EVI 0.13 0.13 0.12 -0.07 0.01 0.03 0.03 0.07 -0.15 0.09
## EVI
## B1 0.13
## B2 0.13
## B3 0.12
## B4 -0.07
## B5 0.01
## B6_VCID_1 0.03
## B6_VCID_2 0.03
## B7 0.07
## NDVI -0.15
## NDBI 0.09
## EVI 1.00
Pairs plot with points colored by veg. We omit some
bands that are highly correlated with others to make the plot more
readable.
library(GGally)
title = 'Band values and landcover type'
dg = dm %>%
mutate(veg = factor(veg))
bands = c('B1', #'B2', 'B3',
'B4', 'B5',
'B6_VCID_1', #'B6_VCID_2',
'B7', 'NDVI', 'NDBI', 'EVI')
g = ggpairs(dg,
aes(color = veg,
fill = veg,
alpha = 0.1,
shape = '20'),
columns = bands,
diag = list(continuous = pub.density)) +
labs(title = title) +
theme_pub(type = 'pairs',
base_size = 8)
g
ggsave("img/pair_plot_bands.png", plot = g)
We can tell from this plot which variables or pairs of variables will
likely help. If we focus on the line NDVI = 0, we can see
that as NDBI and B7 increase the proportion of
veg decreases. The “boundary line” between
veg = 1 and veg = 0 looks like it has a
negative slope, and there are more veg = 1 above that
boundary line.
These observations will be confirmed when fitting some models.
Bands over time for veg = 1 and
veg = 0.
dd = d %>%
filter(!is.infinite(EVI)) %>%
select(-lat, -lon, -landcover, -builtup) %>%
pivot_longer(cols = c(-ID, -veg, -year,
-month, -day, -date)) %>%
mutate(year.mon = year+month/12) %>%
group_by(veg,
year.mon,
name) %>%
summarise(value = mean(value)) %>%
mutate(veg = factor(veg))
head(dd)
title = "Intensity of bands over time"
g = ggplot(dd,
aes(x = year.mon,
y = value,
color = veg))+
geom_line(linewidth = .75)+
geom_point(size = 1)+
facet_wrap(~name,
scales = 'free_y') +
labs(title = title,
x = 'Date',
y = 'Intensity')
g %>%
pub(type = 'line',
facet = T,
base_size = 10,
xbreaks = c(2018, 2019, 2020),
xlabels = as.character(c(2018, 2019, 2020)))
ggsave("img/line_plot_bands_over_time.png", plot = g)
## # A tibble: 6 × 4
## # Groups: veg, year.mon [1]
## veg year.mon name value
## <fct> <dbl> <chr> <dbl>
## 1 0 2017. B1 66.4
## 2 0 2017. B2 59.9
## 3 0 2017. B3 73.1
## 4 0 2017. B4 66.3
## 5 0 2017. B5 86.2
## 6 0 2017. B6_VCID_1 151.
veg and mean NDVILet’s look at the proportion of veg for different
subsets of NDVI.
We can bin our data using the function cut_interval.
dm = dm %>%
mutate(bin = cut_interval(NDVI,
length = 0.05))
head(dd)
## # A tibble: 6 × 4
## # Groups: veg, year.mon [1]
## veg year.mon name value
## <fct> <dbl> <chr> <dbl>
## 1 0 2017. B1 66.4
## 2 0 2017. B2 59.9
## 3 0 2017. B3 73.1
## 4 0 2017. B4 66.3
## 5 0 2017. B5 86.2
## 6 0 2017. B6_VCID_1 151.
Let’s check the counts of veg in each bin.
bin.means = dm %>%
group_by(bin, veg) %>%
count() %>%
pivot_wider(names_from = veg,
values_from = n,
values_fill = 0) %>%
mutate(n = `0` + `1`,
p = `1`/n)
bin.means
## # A tibble: 8 × 5
## # Groups: bin [8]
## bin `0` `1` n p
## <fct> <int> <int> <int> <dbl>
## 1 [-0.15,-0.1] 3 0 3 0
## 2 (-0.1,-0.05] 32 0 32 0
## 3 (-0.05,0] 51 10 61 0.164
## 4 (0,0.05] 13 63 76 0.829
## 5 (0.05,0.1] 1 122 123 0.992
## 6 (0.1,0.15] 0 72 72 1
## 7 (0.15,0.2] 0 27 27 1
## 8 (0.2,0.25] 0 6 6 1
Let’s plot these observed proportions on a scatter plot
bin.means = bin.means %>%
ungroup() %>% #################### !!!!!!!!!!!!
mutate(mid = seq(-.125, .225, by = 0.05))
g = ggplot(dm,
aes(x = NDVI,
y = veg)) +
geom_jitter(alpha = 0.5,
height = 0.2,
width = 0,
color = pubblue) +
geom_point(data = bin.means,
aes(x = mid,
y = p,
size = n),
color = pubred) +
geom_line(data = bin.means,
aes(x = mid,
y = p),
color = pubred)
g %>%
pub(type = 'scatter',
ybreaks = c(0, 0.5, 1))
Based on this plot, we can see that the proportion of
veg is higher for higher values of NDVI. This
is consistent with the fact that NDVI is a good indicator of vegetation.
We also see that the relationship is not linear, but rather an S-shaped
curve, so a logistic regression model might be a good choice for
modeling this relationship.
Instead of looking at only NDVI, let’s look at all of
the band values and indexes. We could make a scatter plot for each
statistic like we did for NDVI above, but it will be easier
to reorganize the data and use facet_wrap as we have done
before. So we’ll we use the long format of the data ds from
above.
head(ds)
## # A tibble: 6 × 4
## ID veg name value
## <int> <dbl> <chr> <dbl>
## 1 2043 1 B1 80.4
## 2 2043 1 B2 69.2
## 3 2043 1 B3 73.4
## 4 2043 1 B4 91.4
## 5 2043 1 B5 79
## 6 2043 1 B6_VCID_1 130.
We now find the proportion of veg for each bin of each
band value. Here, we specify 10 bins for every band, since it would be
hard to find an ideal number of bins per band.
ds = ds %>%
group_by(name) %>%
mutate(bin = cut_interval(value, n=10))
head(ds)
## # A tibble: 6 × 5
## # Groups: name [6]
## ID veg name value bin
## <int> <dbl> <chr> <dbl> <fct>
## 1 2043 1 B1 80.4 (77,81.3]
## 2 2043 1 B2 69.2 (67.5,72.3]
## 3 2043 1 B3 73.4 (67.1,73.6]
## 4 2043 1 B4 91.4 (87.2,91.4]
## 5 2043 1 B5 79 (76.9,83.7]
## 6 2043 1 B6_VCID_1 130. (129,132]
dp = ds %>%
group_by(name, bin) %>%
summarise(p = mean(veg),
n = n())
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
head(dp)
## # A tibble: 6 × 4
## # Groups: name [1]
## name bin p n
## <chr> <fct> <dbl> <int>
## 1 B1 [68.4,72.7] 1 4
## 2 B1 (72.7,77] 1 27
## 3 B1 (77,81.3] 0.929 56
## 4 B1 (81.3,85.6] 0.946 92
## 5 B1 (85.6,89.9] 0.673 104
## 6 B1 (89.9,94.2] 0.557 79
Let’s find the midpoint of each interval, since we’ll need that for
plotting. Since we don’t want to write down 11 different formulas using
seq like we did above for NDVI to find the
midpoints, we’ll extract the left and right coordinates from the bin
using regular expressions, and use those to compute the midpoint.
dp = dp %>%
mutate(left = gsub(',.+', '', bin),
left = gsub('[(]|[[]', '', left),
right = gsub('.+,|[]]', '', bin),
left = as.numeric(left),
right = as.numeric(right),
mid = left/2 + right/2)
head(dp)
## # A tibble: 6 × 7
## # Groups: name [1]
## name bin p n left right mid
## <chr> <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 B1 [68.4,72.7] 1 4 68.4 72.7 70.6
## 2 B1 (72.7,77] 1 27 72.7 77 74.8
## 3 B1 (77,81.3] 0.929 56 77 81.3 79.2
## 4 B1 (81.3,85.6] 0.946 92 81.3 85.6 83.4
## 5 B1 (85.6,89.9] 0.673 104 85.6 89.9 87.8
## 6 B1 (89.9,94.2] 0.557 79 89.9 94.2 92.1
Let’s now make a plot with the value of the stat on the horizontal
axis, veg on the vertical axis, and let’s use
facet_wrap to make a different window for each stat.
We’ll start with just the scatter plots.
g = ggplot(ds,
aes(x = value,
y = veg))+
geom_jitter(height = 0.1,
width = 0,
alpha = 0.1,
color = pubblue)+
facet_wrap(~name,
scales = 'free_x')
g %>%
pub(facet = T)
Now let’s add proportions.
g2 = g +
geom_point(data = dp,
aes(x = mid,
y = p,
size = n),
color = pubred) +
geom_line(data = dp,
aes(x = mid,
y = p),
color = pubred,
linewidth = .5) +
scale_size(range=c(0.5, 3))
g2 %>%
pub(facet = T)
ggsave("img/scatter_plot_bands_vs_veg_dot.png", plot = g2)
As we would expect, there is a positive relationship between
NDVI and veg but a negative relationship
between NDBI and veg. We can get a rough idea
of the relative strength of these relationships as well by looking at
the steepness of the curve. For example, NDVI appears to
have a very strong relationship to B4 is less strong.
We also see that the S-shaped curve appears a lot, especially if you ignore the left and right extremes where there are few data points. We have sized the red dots using the number of observations in each subset, so the small red dots correspond to the subsets with very few data points.